E03 - qPCR (Analysis of Gene Expression @ UCT Prague)


Copy, please, these files and directories to your personal directory:

cp -r ~/shared/AGE_current/Exercises/{E03-qPCR,age_library.R} ~/AGE/Exercises

Assignment

The assignment for this exercise is to implement (mainly visualization) functions, which you will use later on different types of gene expression data. You will implement these functions in age_library.R where are empty function definitions. You don’t have to stick to all predefined parameters and returned variables, just keep in mind that your implementations should make outputs similar to outputs in this document. Also, in several cases there are alternative functions (such as plot_pca() and plot_pca_ggpairs()), but you don’t have to implement them all, if not said.

Sections containing functions to be implemented are marked with πŸ›  Anyway, here is the minimum set of functions you need to implement (arguments are up to you, it’s about the output):

After you implement these functions, modify their calling in this document and render it again. Obviously, you need to remove sections with functions you haven’t implemented.

Please, return the assignment in this form:

Note that your functions will be universal, serving not only for qPCR data. This is mainly true for exploratory analysis functions (hiearchical clustering, PCA, etc.), where input (expression matrix) is the same also for microarray and RNA-seq data.


Introduction

Quantitative real-time PCR monitors the amplification of a targeted DNA molecule during the PCR (i.e., in real time), not at its end, as in conventional PCR.

qPCR measurement

In real time reverse transcription quantitave PCR (RT-qPCR), we measure the amplification of target reverse transcribed DNA molecules. This amplification depends on initial RNA concentration and approximates the gene expression. We select a fluorescence signal threshold \(S_T\) and measure crossing points \(C_P\) (also called threshold cycle \(C_T\)) for samples. To correct for initial RNA concentration and quality, reference genes are used (dashed curves). These genes are considered to be equally expressed in every condition (purple and orange curves), and so serve as the baseline. First, we normalize target genes in each sample to reference gene (or they geometric mean) by calculating \(\Delta C_P = C_P^t - C_P^r\). Obtained \(\Delta C_P\) values are then used to calculate the relative change of gene expression between samples. In this case, we are interested how much more or less is gene expressed in \(\text{treatment}\) sample (\(\text{trt}\)), relative to \(\text{control}\) (\(\text{cnt}\)) sample. This relative change is called \(\text{fold-change}\): \(FC = 2^{-\Delta\Delta C_P} = 2^{-(\Delta C_{P, trt} - \Delta C_{P, cnt})}\). We usually report \(FC\) in \(log_2\) scale: \(LFC = \Delta C_{P, cnt} - \Delta C_{P, trt} = log_2(FC)\) Important is to realize that the higher \(C_P\) is, the lower initial mRNA concentration was. This is why we put minus /\(-\)/ in front of \(\Delta\Delta C_P\). We usually don’t consider the efficiency of PCR and expect the mRNA concentration to double in each cycle. Otherwise \(FC = (1+E)^{-\Delta\Delta C_P}\), where \(E\) is the PCR efficiency.

Our experiment

We will work with data from a pancreatic tumor experiment where an influence of spirulina algae was investigated. In this experiment there are two sample groups:

  • control: samples from tumor.
  • spirulina: samples from tumor treated by extract from spirulina.

Each sample group has three biological replicates (grown on different Petri dish). In addition, each biological replicate is cultivated for 50% and 90% confluence, and for each confluence there are two technical replicates. Uhhh, a nested design, right…

But we can expand it to:

- control group
  |
  |-- biological replicate #1 (Petri dish #1)
  |   |-- confluence 50
  |   |   |-- technical replicate #1 -> sample C1_C50_R1
  |   |   |-- technical replicate #2 -> sample C2_C50_R2
  |   |-- confluence 90
  |       |-- technical replicate #1 -> sample C1_C90_R1
  |       |-- technical replicate #2 -> sample C1_C90_R2
  |-- biological replicate #2 (Petri dish #2)
      |-- confluence 50
      |   |-- technical replicate #1 -> sample C2_C50_R1
      |   |-- technical replicate #2 -> sample C2_C50_R2
      |-- confluence 90
          |-- ...

- spirulina group
  |
  |-- biological replicate #1 (Petri dish #1)
  |   |-- confluence 50
  |   |   |-- technical replicate #1 -> sample S1_C50_R1
  |   |   |-- technical replicate #2 -> sample S1_C50_R2
  |   |-- confluence 90
  |       |-- technical replicate #1 -> sample S1_C90_R1
  |       |-- technical replicate #2 -> sample S1_C90_R2
  |-- biological replicate #2 (Petri dish #2)
      |-- confluence 50
      |   |-- technical replicate #1 -> sample S2_C50_R1
      |   |-- technical replicate #2 -> sample S2_C50_R2
      |-- confluence 90
          |-- ...

Also, there are two groups of measured genes:

  1. Target genes - interesting genes found prior by microarray exploratory analysis.
  2. Housekeeping genes - used as reference genes.

Libraries

library(here)
library(tidyverse)
library(dplyr)
library(glue)
library(magrittr)

Now we source your personal library:

source(here("age_library.R"))

Config

I would recommend you to always define constants at the beginning of a script. It is more transparent, and if you, for example, change some path, you haven’t to replace it in whole source code. Constants are, by convention, named in UPPERCASE.

BASE_DIR <- here("E03-qPCR")
CP_DATA <- here(BASE_DIR, "data/qPCR.Rds")

Data preprocessing

\(C_P\) matrix

We have already preprocessed a \(C_P\) matrix for you. But if you really want to start from the floor, you can create the \(C_P\) matrix from original files (here and here). Not very pretty, right? 🀭

In \(C_P\) matrix, columns are samples, rows are genes, and values are \(C_P\) values. This format is generally used for gene expression data and you will see it also in case of microarrays and RNA-seq.

cp <- readRDS(CP_DATA) %>%
  as.data.frame()
genes <- rownames(cp)
samples <- colnames(cp)

cp[1:5, 1:5]

CP values >= 40 are considered out of qPCR limit and so we set them to NA:

cp[cp > 39.9] <- NA

Some genes have so many NA values (respectively, they don’t have many values other than NA):

lq_genes <- rowSums(!is.na(cp))
lq_genes
##     CD24     CD31   CXCL12    CXCR4    CXCR7     FLT1    VEGFA   VEGRF2 
##       24       19        2       10        0        0       24        2 
##  ACTB H4 GAPDH H5 RPS9 H11  TBP H26 
##       24       24       24       24

Let’s remove genes with number of non-NA values less or equal to 2:

lq_genes <- names(lq_genes[lq_genes <= 2])
genes <- setdiff(genes, lq_genes)
cp <- cp[genes, ]

Now we split genes to target and reference ones. The latter have H<number> on the end of their names, which we can match with regular expression (regex) H\d+$.

ref_genes <- str_subset(genes, "H\\d+$")
target_genes <- setdiff(genes, ref_genes)
gene_groups <- if_else(genes %in% ref_genes, "reference", "target")
genes_df <- data.frame(
  gene = as.character(genes),
  gene_group = factor(gene_groups, levels = c("reference", "target")),
  
  row.names = genes,
  stringsAsFactors = FALSE
)
genes_df

For plotting purposes, we create a new dataframe with NAs replaced by 40:

cp_plot <- replace(cp, is.na(cp), 40)

Phenotypical data (sample sheet)

Along with measured data you usually obtain a sample sheet, but in this case we are going to parse the column names of the \(C_P\) matrix. Everything needed is contained there:

  • K and SP: sample groups
  • Number behind sample group: Petri dish (= biological replicate)
  • Number behind slash (β€œ/”): confluence
  • (2RT): second technical replicate

str_match() is returning a character matrix of capturing groups (columns) for each string (rows). The first column is always the full match.

(samples <- colnames(cp))
##  [1] "K1/50"        "K1/50(2RT)"   "K1/90"        "K1/90 (2RT)"  "K2/50"       
##  [6] "K2/50(2RT)"   "K2/90"        "K2/90 (2RT)"  "K3/50"        "K3/50(2RT)"  
## [11] "K3/90"        "K3/90 (2RT)"  "SP1/50"       "SP1/50 (2RT)" "SP1/90"      
## [16] "SP1/90 (2RT)" "SP2/50"       "SP2/50 (2RT)" "SP2/90"       "SP2/90 (2RT)"
## [21] "SP3/50"       "SP3/50(2RT)"  "SP3/90"       "SP3/90 (2RT)"
pheno_data <- data.frame(
  sample_id = samples,
  # Match with a single capturing group (full match): "SP" or "K" on the beginning.
  sample_group_code = str_match(samples, "^SP|K")[, 1] %>% recode_factor("K" = "c", "SP" = "sp"),
  # Match a sample group on the beginning and then the number behind (second group).
  petri_number = str_match(samples, "^[A-Z]{1,2}(\\d)")[, 2] %>% as.factor(),
  # Match a two-digit number behind "/".
  confluence = str_match(samples, "\\/(\\d{2})")[, 2] %>% as.factor(),
  # Detect a second technical replicate. See how literal match of parentheses ("(" and ")") must be written:
  # first you have to escape backslash in R as "\\" to escape parenthese for regex as "\\(", which will match the character "(" literally,
  # e.g. not specifying the regex capturing group.
  replicate = if_else(str_detect(samples, "\\(2RT\\)"), 2L, 1L) %>% as.factor()
) %>%
  dplyr::mutate(
    sample_name_rep = glue("{sample_group_code}{petri_number}_{confluence}_r{replicate}") %>% as.character(),
    sample_name = glue("{sample_group_code}{petri_number}_{confluence}") %>% as.character(),
    sample_group = recode_factor(sample_group_code, "c" = "control", "sp" = "spirulina")
  ) %>%
  dplyr::select(sample_name_rep, sample_name, everything()) %>%
  set_rownames(.$sample_name_rep)

colnames(cp) <- rownames(pheno_data)
colnames(cp_plot) <- rownames(pheno_data)
samples <- colnames(cp)

head(pheno_data)

If you are not sure what the regexes above mean, refer to the stringr / Regular expressions section in E02 - Intro to R.

Long data

We also create long data from cp, pheno_data and genes_df:

cp_long <- as.data.frame(cp) %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = "sample_name_rep", values_to = "cp") %>%
  dplyr::left_join(pheno_data, by = "sample_name_rep") %>%
  dplyr::left_join(genes_df, by = "gene") %>%
  dplyr::mutate(
    cp_plot = if_else(is.na(cp), 40, cp)
  ) %>%
  dplyr::select(sample_name_rep, sample_name, gene, cp, cp_plot, gene_group, everything())

head(cp_long)

Exploratory analysis on raw data

We are using exploratory analysis to have an unbiased first look at data. It is also a crucial step to assess the biological quality control, e.g.Β whether samples from distinct groups create separate clusters - if not, samples might be swapped etc.

Hiearchical clustering πŸ› 

Function to be implemented within the assignment. It takes an expression matrix and returns a dendrogram with coloring by a chosen variable. I recommend you to use the dendextend package to modify a dendrogram, it is much easier than doing so in base R.

coloring_variables <- c("sample_group", "confluence", "petri_number")

for (variable in coloring_variables) {
  plot_hc(
    cp_plot, color_by = pheno_data[, variable],
    method_distance = "euclidean", method_clustering = "complete",
    color_by_lab = variable
  )
}

We can see the main effect - with several exceptions, samples are clustered by sample groups, which is good. There isn’t any effect of other variables, which would be confounding.

PCA πŸ› 

Function to be implemented within the assignment. It will perform PCA visualization with point coloring by a chosen variable. You can also output plots of first N principal components and variance explained bar plot.

We can see that the first principal component (PC1) is nicely separating our samples by sample group:

plot_pca(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence")$plot

plot_pca(cp_plot, pheno_data, plot_type = "multi", color_by = "sample_group", shape_by = "confluence")$plot

plot_pca(cp_plot, pheno_data, color_by = "sample_group", shape_by = "confluence", label_by = "sample_name_rep")$plot

plot_pca(cp_plot, pheno_data, color_by = "sample_group", shape_by = "petri_number")$plot

  • You can also use GGally::ggpairs() to plot a grid of PCA plots (PC1 vs.Β PC2, PC1 vs.Β PC3, PC2 vs.Β PC3, etc.):
plot_pca_ggpairs(cp_plot, pheno_data, n_components = 5, color_by = "sample_group", shape_by = "confluence")$plot

Heatmap πŸ› 

Function to be implemented within the assignment. ComplexHeatmap package is preferred. For interactive heatmaps you can use heatmaply (introduced in E02 - Intro to R).

p_heatmap <- plot_heatmap(
  cp,
  column_annotation = dplyr::select(pheno_data, `Sample Group` = sample_group, `Petri Dish` = petri_number, Confluence = confluence, Replicate = replicate),
  row_annotation = dplyr::select(genes_df, `Gene Group` = gene_group),
  legend_title = "Cp",
  show_column_names = FALSE
)
draw(p_heatmap, merge_legend = TRUE)

Can you see any outlying samples (replicates)? If yes, should they be removed?

Selecting the reference genes

We will use housekeeping genes as reference genes. Housekeeping genes are genes which have similar expression profile regardless the cell type or state. That is why they can be used for sample normalization (as an internal control).

Clustering

Housekeeping genes should cluster together. Let’s try it using hierachical clustering:

plot_hc(t(cp), color_by = gene_groups, color_by_lab = "Gene groups")

Also in PCA, we should see them separated from target genes:

plot_pca(t(cp_plot), sample_data = genes_df, color_by = "gene_group")$plot

Not very sharp clustering, but PC1 separates the gene groups well.

Correlation

Expression of housekeeping genes is/should be correlated.

main_title <- "Raw data correlation"

t(cp) %>%
  as.data.frame() %>%
  GGally::ggpairs(progress = FALSE) +
  theme_bw()

Let’s look more closely at the correlation coefficients:

(gene_cor <- cor(t(cp), use = "pairwise.complete.obs"))
##                 CD24        CD31       CXCR4       VEGFA     ACTB H4
## CD24      1.00000000  0.03970592 -0.28809768  0.73649418  0.73547823
## CD31      0.03970592  1.00000000 -0.73606175  0.03747391  0.03172092
## CXCR4    -0.28809768 -0.73606175  1.00000000 -0.20149851 -0.18053811
## VEGFA     0.73649418  0.03747391 -0.20149851  1.00000000  0.77468336
## ACTB H4   0.73547823  0.03172092 -0.18053811  0.77468336  1.00000000
## GAPDH H5  0.73141005  0.17532289 -0.02185175  0.82725327  0.92233124
## RPS9 H11  0.76319246  0.10746231  0.08141219  0.83395485  0.89743784
## TBP H26   0.72419041  0.08438151 -0.25461239  0.77534719  0.93865311
##             GAPDH H5   RPS9 H11     TBP H26
## CD24      0.73141005 0.76319246  0.72419041
## CD31      0.17532289 0.10746231  0.08438151
## CXCR4    -0.02185175 0.08141219 -0.25461239
## VEGFA     0.82725327 0.83395485  0.77534719
## ACTB H4   0.92233124 0.89743784  0.93865311
## GAPDH H5  1.00000000 0.96305770  0.93496580
## RPS9 H11  0.96305770 1.00000000  0.90135469
## TBP H26   0.93496580 0.90135469  1.00000000
ggcorrplot::ggcorrplot(gene_cor, method = "circle") +
  ggtitle(main_title)

You can also use the heatmaply package:

plot_heatmaply(gene_cor, row_annotation = dplyr::select(genes_df, `Gene Group` = gene_group), main = main_title, legend_title = "Correlation")

At the first sight, housekeeping (reference) genes are well correlated, with correlation coefficients over 0.9

Housekeeping genes should cluster together also by their correlation coefficients. We can try both Euclidean and Pearson distance (1 - correlation coefficient) for hiearchical clustering.

plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "euclidean", title = "Hierarchical clustering (Euclidean distance)")

plot_hc(gene_cor, color_by = gene_groups, color_by_lab = "Gene groups", method_distance = "pearson", title = "Hierarchical clustering (Pearson distance)")

PCA of correlation coefficients is also helpful:

plot_pca(gene_cor, sample_data = genes_df, color_by = "gene_group")$plot

From both hiearchical clustering and PCA you can see that reference genes make a compact cluster, which is good.

geNorm M-values πŸ› 

Function (compute_m()) to be implemented within the assignment. Pseudocode can be found in documentation for this function in age_library.R. Your function should output values similar to those below (or at least numerically very close to them).

Until now we were doing rather exploratory analysis of reference genes, but it’s time to make a more rigorous decision. M values are telling you how stable is a gene expressed across the samples. From the original paper:

β€œThis measure relies on the principle that the expression ratio of two ideal internal control genes is identical in all samples, regardless of the experimental condition or cell type. In this way, variation of the expression ratios of two real-life housekeeping genes reflects the fact that one (or both) of the genes is (are) not constantly expressed, with increasing variation in ratio corresponding to decreasing expression stability.”

For a further explanation see the section β€œGene-stability measure and ranking of selected housekeeping genes” in the paper above.

Hints for geNorm implementation:

  • Remember that R is vectorized. In this way you can directly write most of equations as you see them on paper.
  • You will most likely need the apply() function.

Let’s compute the M values:

purrr::map_dbl(ref_genes, ~ compute_m(., cp_plot[ref_genes, ])) %>%
  set_names(ref_genes)
##   ACTB H4  GAPDH H5  RPS9 H11   TBP H26 
## 0.5875240 0.6759988 0.7351183 0.5453682

All reference genes seem to be OK. But compare them with target genes - they would not be good reference genes:

purrr::map_dbl(target_genes, ~ compute_m(., cp_plot[target_genes, ])) %>%
  set_names(target_genes)
##     CD24     CD31    CXCR4    VEGFA 
## 1.622029 2.503237 2.180548 1.592555

Normalization to reference genes: \(\Delta C_P\)

For each sample we calculate a geometric mean of reference genes. Don’t be confused, it’s just this formula applied to reference genes of each sample.

norm <- apply(cp[ref_genes, ], 2, function(x) {
  log(x) %>% mean() %>% exp()
})

head(norm)
## c1_50_r1 c1_50_r2 c1_90_r1 c1_90_r2 c2_50_r1 c2_50_r2 
## 23.69656 23.90002 24.05644 24.08679 23.30099 23.48331

Let’s look how these geometric means of reference genes behave. As you can see, they lie in the centre of reference genes, and that’s what we want (and expect).

gene_cor_norm <- rbind(cp, norm = norm) %>%
  t() %>%
  cor(use = "pairwise.complete.obs")

genes_df_norm <- genes_df %>%
  dplyr::mutate(gene_group = factor(gene_group, levels = c("reference", "target", "norm"))) %>%
  rbind(norm = c("norm", "norm"))

ggcorrplot::ggcorrplot(gene_cor_norm, method = "circle") +
  ggtitle(main_title)

plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "euclidean", title = "Hierarchical clustering (Euclidean distance)")

plot_hc(gene_cor_norm, color_by = genes_df_norm$gene_group, color_by_lab = "Gene groups", method_distance = "pearson", title = "Hierarchical clustering (Pearson distance)")

plot_pca(gene_cor_norm, genes_df_norm, color_by = "gene_group")$plot

Now we calculate \(\Delta C_P\) values by subtracting the reference gene means from target genes. First, we create a matrix of the same size as cp, with reference gene mean of each sample in columns.

ref_gene_cp_matrix <- matrix(norm, ncol = ncol(cp), nrow = nrow(cp), byrow = TRUE)
ref_gene_cp_matrix[1:5, 1:5]
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 23.69656 23.90002 24.05644 24.08679 23.30099
## [2,] 23.69656 23.90002 24.05644 24.08679 23.30099
## [3,] 23.69656 23.90002 24.05644 24.08679 23.30099
## [4,] 23.69656 23.90002 24.05644 24.08679 23.30099
## [5,] 23.69656 23.90002 24.05644 24.08679 23.30099

Now we are ready for subtraction, that is, to calculate \(\Delta C_P = C_P^t - C_{P,mean}^r\):

delta_cp <- cp - ref_gene_cp_matrix
delta_cp[1:5, 1:5]

Now we average the technical replicates. We will use a handy function from limma package, which we will use later for microarray data analysis. avearrays() is basically doing this: splits columns to groups, calculates within group average of each gene and concatenates the results to matrix or dataframe. We split our samples by sample_name column in pheno_data. You can see that for each unique sample_name, there is either replicate 1 or 2.

head(pheno_data)
delta_cp_avg <- limma::avearrays(delta_cp, pheno_data$sample_name)
delta_cp_avg[1:5, 1:5]
##            c1_50     c1_90    c2_50     c2_90     c3_50
## CD24     8.09171  8.063385  8.59785  8.619696  8.614854
## CD31    12.29171 10.143385 11.40785 11.369696 10.450404
## CXCR4   12.12998       NaN 13.63669 13.757579 14.114854
## VEGFA    9.18171  8.678385  9.46285  8.759696  8.794854
## ACTB H4  1.59171  1.503385  1.78785  1.944696  1.879854

Just for control:

(rowMeans(delta_cp[, c("c1_50_r1", "c1_50_r2")], na.rm = TRUE) == delta_cp_avg[, "c1_50"]) %>%
  all() %>%
  stopifnot()

Let’s finalize the \(\Delta C_P\) calculation. We keep only target genes and again, create a separate matrix for plotting purposes. We also add \(\Delta C_P\) to our long data.

delta_cp_avg <- delta_cp_avg[target_genes, ]
delta_cp_avg_plot <- replace(delta_cp_avg, is.na(delta_cp_avg), ceiling(max(delta_cp_avg, na.rm = TRUE) + 1))

Remove replicates from pheno_data and cp_long:

pheno_data_avg <- dplyr::filter(pheno_data, replicate == 1) %>%
  dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
  dplyr::select(sample_name, everything()) %>%
  set_rownames(.$sample_name)

cp_long <- dplyr::filter(cp_long, replicate == 1) %>%
  dplyr::select(-sample_name_rep, -sample_id, -replicate) %>%
  dplyr::select(sample_name, everything())

And add computed \(\Delta C_P\) to long data:

cp_long <- as.data.frame(delta_cp_avg) %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp") %>%
  dplyr::right_join(cp_long, by = c("sample_name", "gene"))

cp_long <- as.data.frame(delta_cp_avg_plot) %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = "sample_name", values_to = "delta_cp_plot") %>%
  dplyr::right_join(cp_long, by = c("sample_name", "gene"))

cp_long_target <- dplyr::filter(cp_long, gene_group == "target")

head(cp_long_target)

As you can see, the code above is quite repetitive - better would be to write a function for that πŸ™‚


Exploratory analysis on summarised and normalised data

Now when our data are normalized, we can check for sample outliers, as these could mess up differential expression analysis.

plot_pca(delta_cp_avg_plot, pheno_data_avg, color_by = "sample_group", shape_by = "confluence", label_by = "sample_name")$plot

for (variable in coloring_variables) {
  plot_hc(
    delta_cp_avg_plot,
    color_by = pheno_data_avg[, variable],
    method_distance = "euclidean",
    method_clustering = "complete",
    color_by_lab = variable
  )
}

plot_heatmap(delta_cp_avg_plot, column_annotation = pheno_data_avg[, coloring_variables], title = "qPCR", legend_title = "deltaCp") %>%
  draw(merge_legend = TRUE)

We can see that samples c1_90, sp3_50 and sp1_90 are a little bit suspicious. Stricter people would consider their removal, but that is the last solution if final results are not satisfactory. In biology, one has to always keep in mind its natural variation.


Calculating (\(log_2\)) fold-changes: \(-\Delta\Delta C_P\)

Now we calculate the relative change in gene expression between sample groups. For the sake of simplicity, we don’t take into account other variables, e.g.Β confluence. Also from the exploratory analysis you can see that these variables don’t have much effect on the separation of samples - sample group has the largest influence.

First, we calculate the \(mean(C_P)\) of control samples:

(control_means <- cp_long_target %>%
  dplyr::filter(sample_group == "control") %>%
  dplyr::group_by(gene) %>%
  dplyr::summarise(delta_cp_control_mean = mean(delta_cp, na.rm = TRUE))
)

Then we join this table with our long data and calculate:

  • \(\Delta \Delta C_P = \Delta C_{P, treatment} - \text{mean}(\Delta C_{P, control})\)
  • \(log_2(\text{fold-change}) = -\Delta \Delta C_P\)
  • \(\text{fold-change} = 2^{LFC}\)
cp_long_target <- cp_long_target %>%
  dplyr::left_join(control_means, by = "gene") %>%
  dplyr::mutate(
    delta_delta_cp = delta_cp - delta_cp_control_mean,
    lfc = -delta_delta_cp,
    fc = 2^lfc
  ) %>%
  dplyr::select(sample_name, gene, cp, cp_plot, delta_cp, delta_cp_plot, delta_cp_control_mean, delta_delta_cp, lfc, fc, everything())

head(cp_long_target)

\(\text{mean}(\Delta \Delta C_{P, control})\) is now \(0\), because we subtracted \(\text{mean}(\Delta C_P, control)\) also from the control samples. Of course this also applies for \(LFC\), which is just \(-\Delta \Delta C_P\). As you will see, this is useful for plotting (boxplots).

Boxplots πŸ› 

Boxplots are very informative when you want to see a difference between groups.

Function to be implemented within the assignment. It should work for any type of long data, as you will be able to specify the grouping column to split boxplots on x-axe, and column with values to plot on y-axe. Tip: you can use ggpubr::ggboxplot()

First we will plot the \(\Delta C_P\) values:

plot_boxplots(
    cp_long_target,
    x = "sample_group",
    y = "delta_cp",
    facet_by = "gene",
    color_by = "sample_group",
    main = "delta Cp values",
    y_lab = "delta Cp"
)

And this is tricky to interpret at the first sight, because in spirulina group, all genes are actually upregulated! Also, you cannot directly see the \(LFCs\) - in your head, you have to calculate the difference between spirulina and control and flip the sign.

So better is to plot the \(LFCs\). Why not the \(\text{fold-changes}\)? Because \(LFCs\) are easier to read, as changes (relative to controls) are multiplicative and symmetric:

\(\text{fold-change}\) 0.125 0.25 0.5 1 2 4 8
\(log_2(\text{fold-change})\) -3 -2 -1 0 1 2 3
You read: Change is… 8x less 4x less 2x less no change 2x more 4x more 8x more

This figure summarises how \(LFCs\) were calculated and why it is better for boxplots πŸ€“

Now you can clearly see that spirulina group have upregulated genes:

plot_boxplots(
    cp_long_target,
    x = "sample_group",
    y = "lfc",
    facet_by = "gene",
    color_by = "sample_group",
    main = "log2 fold-changes",
    y_lab = "log2 fold-change"
)

We can also split plots by confluence:

plot_boxplots(
    cp_long_target,
    x = "sample_group",
    y = "lfc",
    facet_by = "gene",
    color_by = "confluence",
    main = "log2 fold-changes",
    y_lab = "log2 fold-change"
)

Alternatively to boxplots, we can make barplots, which are quite popular in papers πŸ™‚

ggpubr::ggbarplot(
  cp_long_target,
  x = "sample_group",
  y = "fc",
  add = c("mean", "jitter"),
  fill = "sample_group",
  facet.by = "gene"
) +
  scale_y_continuous("Relative concentration", labels = scales::percent) +
  labs(x = NULL, fill = "Sample Group")

Or we can plot a heatmap of LFC values or their z-scores:

lfc_wide <- cp_long_target %>%
  dplyr::select(sample_name, gene, lfc) %>%
  pivot_wider(names_from = sample_name, values_from = lfc) %>%
  column_to_rownames(var = "gene")
lfc_wide[, 1:5]
plot_heatmap(
  lfc_wide,
  column_annotation = dplyr::select(pheno_data_avg, `Sample Group` = sample_group),
  title = "LFC between spirulina and control",
  legend_title = "LFC",
  cluster_rows = FALSE,
  cluster_columns = FALSE
) %>% draw(merge_legend = TRUE)

plot_heatmap(
  lfc_wide,
  z_score = TRUE,
  column_annotation = dplyr::select(pheno_data_avg, `Sample Group` = sample_group),
  title = "LFC between spirulina and control",
  legend_title = "LFC z-score",
  cluster_rows = FALSE,
  cluster_columns = FALSE
) %>% draw(merge_legend = TRUE)

t-test πŸ› 

Functions to be implemented within the assignment. You should implement both test_gene() and test_gene_table(). If you do it smart, you can just call the former from the latter. Hint: look at the formula parameter of t.test() function.

t-test is testing whether two means, coming from samples having Student’s distribution, significantly differ. t-test assumes that the input \(C_P\) values are normally distributed and the variance between conditions is comparable. Wilcoxon test can be used when sample size is small and those two last assumptions are hard to achieve.

As we will use two-sided alternative (i.e.Β gene is up- or downregulated = deregulated), it doesn’t matter which one of \(\Delta C_P\), \(\Delta \Delta C_P\) or \(LFC\) we will use. In all cases, the absolute difference between group means remains the same:

group_means <- dplyr::filter(cp_long_target, gene == "CD24") %>%
  dplyr::group_by(sample_group) %>%
  dplyr::summarise(
    delta_cp_mean = mean(delta_cp),
    delta_delta_cp_mean = mean(delta_delta_cp),
    lfc_mean = mean(lfc)
  ) %>%
  as.data.frame() %>%
  tibble::column_to_rownames("sample_group")

group_means
apply(group_means, MARGIN = 2, FUN = function(x) abs(x[1] - x[2]))
##       delta_cp_mean delta_delta_cp_mean            lfc_mean 
##            1.057847            1.057847            1.057847
for (g in target_genes) {
  test_gene(
    gene = g,
    gene_data = cp_long_target,
    gene_col = "gene",
    value_col = "lfc",
    group_col = "sample_group",
    test = t.test,
    verbose = TRUE
  )
}
## ╔══════════════════════════════╗
## β•‘ CD24: spirulina vs. control  β•‘
## β•‘ t.test p-value: 0.000959 *** β•‘
## β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
## ╔═════════════════════════════╗
## β•‘ CD31: spirulina vs. control β•‘
## β•‘ t.test p-value: 0.00535 **  β•‘
## β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
## ╔══════════════════════════════╗
## β•‘ CXCR4: spirulina vs. control β•‘
## β•‘ t.test p-value: 0.168 NS     β•‘
## β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
## ╔══════════════════════════════╗
## β•‘ VEGFA: spirulina vs. control β•‘
## β•‘ t.test p-value: 0.00234 **   β•‘
## β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

But better is to return a table for all genes:

test_table <- test_gene_table(
  gene_data = cp_long_target,
  gene_col = "gene",
  value_col = "delta_cp",
  group_col = "sample_group"
)
test_table

Cleanup

save(list = ls(all.names = TRUE), file = here(BASE_DIR, "qPCR.RData"), envir = environment())

warnings()
traceback()
## No traceback available
sessioninfo::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Debian GNU/Linux 10 (buster)
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Europe/Prague
##  date     2022-02-16
##  pandoc   2.11.2 @ /usr/lib/rstudio-server/bin/pandoc/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package        * version    date (UTC) lib source
##  abind            1.4-5      2016-07-21 [1] CRAN (R 4.1.2)
##  assertthat       0.2.1      2019-03-21 [1] CRAN (R 4.1.2)
##  backports        1.4.1      2021-12-13 [1] CRAN (R 4.1.2)
##  BiocGenerics     0.40.0     2021-10-26 [1] Bioconductor
##  bookdown         0.24       2021-09-02 [1] CRAN (R 4.1.1)
##  broom            0.7.12     2022-01-28 [1] CRAN (R 4.1.2)
##  bslib            0.3.1      2021-10-06 [1] CRAN (R 4.1.2)
##  car              3.0-12     2021-11-06 [1] CRAN (R 4.1.2)
##  carData          3.0-5      2022-01-06 [1] CRAN (R 4.1.2)
##  cellranger       1.1.0      2016-07-27 [1] CRAN (R 4.1.2)
##  circlize         0.4.13     2021-06-09 [1] CRAN (R 4.1.0)
##  cli              3.1.1      2022-01-20 [1] CRAN (R 4.1.2)
##  clue             0.3-60     2021-10-11 [1] CRAN (R 4.1.2)
##  cluster          2.1.2      2021-04-17 [3] CRAN (R 4.1.1)
##  codetools        0.2-18     2020-11-04 [1] CRAN (R 4.1.2)
##  colorspace       2.0-2      2021-06-24 [1] CRAN (R 4.1.0)
##  ComplexHeatmap * 2.10.0     2021-10-26 [1] Bioconductor
##  crayon           1.4.2      2021-10-29 [1] CRAN (R 4.1.2)
##  crosstalk        1.2.0      2021-11-04 [1] CRAN (R 4.1.2)
##  data.table       1.14.2     2021-09-27 [1] CRAN (R 4.1.1)
##  DBI              1.1.2      2021-12-20 [1] CRAN (R 4.1.2)
##  dbplyr           2.1.1      2021-04-06 [1] CRAN (R 4.1.2)
##  dendextend       1.15.2     2021-10-28 [1] CRAN (R 4.1.2)
##  digest           0.6.29     2021-12-01 [1] CRAN (R 4.1.2)
##  doParallel       1.0.17     2022-02-07 [1] CRAN (R 4.1.2)
##  dplyr          * 1.0.8      2022-02-08 [1] CRAN (R 4.1.2)
##  ellipsis         0.3.2      2021-04-29 [1] CRAN (R 4.1.2)
##  emo              0.0.0.9000 2021-02-26 [1] Github (hadley/emo@3f03b11)
##  evaluate         0.14       2019-05-28 [1] CRAN (R 4.1.2)
##  fansi            1.0.2      2022-01-14 [1] CRAN (R 4.1.2)
##  farver           2.1.0      2021-02-28 [1] CRAN (R 4.1.2)
##  fastmap          1.1.0      2021-01-25 [1] CRAN (R 4.1.2)
##  forcats        * 0.5.1      2021-01-27 [1] CRAN (R 4.1.2)
##  foreach          1.5.2      2022-02-02 [1] CRAN (R 4.1.2)
##  friendlyeval   * 0.1.1      2021-02-26 [1] Github (milesmcbain/friendlyeval@7ea2ed9)
##  fs               1.5.2      2021-12-08 [1] CRAN (R 4.1.2)
##  generics         0.1.2      2022-01-31 [1] CRAN (R 4.1.2)
##  GetoptLong       1.0.5      2020-12-15 [1] CRAN (R 4.1.2)
##  GGally           2.1.2      2021-06-21 [1] CRAN (R 4.1.0)
##  ggcorrplot       0.1.3      2019-05-19 [1] CRAN (R 4.1.2)
##  ggplot2        * 3.3.5      2021-06-25 [1] CRAN (R 4.1.0)
##  ggpubr           0.4.0      2020-06-27 [1] CRAN (R 4.1.2)
##  ggrepel          0.9.1      2021-01-15 [1] CRAN (R 4.1.2)
##  ggsignif         0.6.3      2021-09-09 [1] CRAN (R 4.1.1)
##  GlobalOptions    0.1.2      2020-06-10 [1] CRAN (R 4.1.2)
##  glue           * 1.6.1      2022-01-22 [1] CRAN (R 4.1.2)
##  gridExtra        2.3        2017-09-09 [1] CRAN (R 4.1.2)
##  gtable           0.3.0      2019-03-25 [1] CRAN (R 4.1.2)
##  haven            2.4.3      2021-08-04 [1] CRAN (R 4.1.0)
##  heatmaply        1.3.0      2021-10-09 [1] CRAN (R 4.1.2)
##  here           * 1.0.1      2020-12-13 [1] CRAN (R 4.1.2)
##  highr            0.9        2021-04-16 [1] CRAN (R 4.1.2)
##  hms              1.1.1      2021-09-26 [1] CRAN (R 4.1.1)
##  htmltools        0.5.2      2021-08-25 [1] CRAN (R 4.1.1)
##  htmlwidgets      1.5.4      2021-09-08 [1] CRAN (R 4.1.1)
##  httr             1.4.2      2020-07-20 [1] CRAN (R 4.1.2)
##  IRanges          2.28.0     2021-10-26 [1] Bioconductor
##  iterators        1.0.14     2022-02-05 [1] CRAN (R 4.1.2)
##  jquerylib        0.1.4      2021-04-26 [1] CRAN (R 4.1.2)
##  jsonlite         1.7.3      2022-01-17 [1] CRAN (R 4.1.2)
##  klippy         * 0.0.0.9500 2021-02-26 [1] Github (rlesur/klippy@378c247)
##  knitr            1.37       2021-12-16 [1] CRAN (R 4.1.2)
##  labeling         0.4.2      2020-10-20 [1] CRAN (R 4.1.2)
##  lazyeval         0.2.2      2019-03-15 [1] CRAN (R 4.1.2)
##  lifecycle        1.0.1      2021-09-24 [1] CRAN (R 4.1.1)
##  limma            3.50.0     2021-10-26 [1] Bioconductor
##  lubridate        1.8.0      2021-10-07 [1] CRAN (R 4.1.2)
##  magick           2.7.3      2021-08-18 [1] CRAN (R 4.1.1)
##  magrittr       * 2.0.2      2022-01-26 [1] CRAN (R 4.1.2)
##  matrixStats      0.61.0     2021-09-17 [1] CRAN (R 4.1.1)
##  modelr           0.1.8      2020-05-19 [1] CRAN (R 4.1.2)
##  munsell          0.5.0      2018-06-12 [1] CRAN (R 4.1.2)
##  patchwork      * 1.1.1      2020-12-17 [1] CRAN (R 4.1.2)
##  pillar           1.7.0      2022-02-01 [1] CRAN (R 4.1.2)
##  pkgconfig        2.0.3      2019-09-22 [1] CRAN (R 4.1.2)
##  plotly           4.10.0     2021-10-09 [1] CRAN (R 4.1.2)
##  plyr             1.8.6      2020-03-03 [1] CRAN (R 4.1.2)
##  png              0.1-7      2013-12-03 [1] CRAN (R 4.1.2)
##  purrr          * 0.3.4      2020-04-17 [1] CRAN (R 4.1.2)
##  R6               2.5.1      2021-08-19 [1] CRAN (R 4.1.1)
##  RColorBrewer   * 1.1-2      2014-12-07 [1] CRAN (R 4.1.2)
##  Rcpp             1.0.8      2022-01-13 [1] CRAN (R 4.1.2)
##  readr          * 2.1.2      2022-01-30 [1] CRAN (R 4.1.2)
##  readxl           1.3.1      2019-03-13 [1] CRAN (R 4.1.2)
##  registry         0.5-1      2019-03-05 [1] CRAN (R 4.1.2)
##  reprex           2.0.1      2021-08-05 [1] CRAN (R 4.1.0)
##  reshape          0.8.8      2018-10-23 [1] CRAN (R 4.1.2)
##  reshape2         1.4.4      2020-04-09 [1] CRAN (R 4.1.2)
##  rjson            0.2.21     2022-01-09 [1] CRAN (R 4.1.2)
##  rlang            1.0.1      2022-02-03 [1] CRAN (R 4.1.2)
##  rmarkdown        2.11       2021-09-14 [1] CRAN (R 4.1.1)
##  rmdformats       1.0.3      2021-10-06 [1] CRAN (R 4.1.2)
##  rprojroot        2.0.2      2020-11-15 [1] CRAN (R 4.1.2)
##  rstatix          0.7.0      2021-02-13 [1] CRAN (R 4.1.2)
##  rstudioapi       0.13       2020-11-12 [1] CRAN (R 4.1.2)
##  rvest            1.0.2      2021-10-16 [1] CRAN (R 4.1.2)
##  S4Vectors        0.32.3     2021-11-21 [1] Bioconductor
##  sass             0.4.0      2021-05-12 [1] CRAN (R 4.1.0)
##  scales           1.1.1      2020-05-11 [1] CRAN (R 4.1.2)
##  seriation        1.3.1      2021-10-16 [1] CRAN (R 4.1.2)
##  sessioninfo      1.2.2      2021-12-06 [1] CRAN (R 4.1.2)
##  shape            1.4.6      2021-05-19 [1] CRAN (R 4.1.0)
##  stringi          1.7.6      2021-11-29 [1] CRAN (R 4.1.2)
##  stringr        * 1.4.0      2019-02-10 [1] CRAN (R 4.1.2)
##  tibble         * 3.1.6      2021-11-07 [1] CRAN (R 4.1.2)
##  tidyr          * 1.2.0      2022-02-01 [1] CRAN (R 4.1.2)
##  tidyselect       1.1.1      2021-04-30 [1] CRAN (R 4.1.2)
##  tidyverse      * 1.3.1      2021-04-15 [1] CRAN (R 4.1.2)
##  TSP              1.1-11     2021-10-06 [1] CRAN (R 4.1.2)
##  tzdb             0.2.0      2021-10-27 [1] CRAN (R 4.1.2)
##  utf8             1.2.2      2021-07-24 [1] CRAN (R 4.1.0)
##  vctrs            0.3.8      2021-04-29 [1] CRAN (R 4.1.2)
##  viridis          0.6.2      2021-10-13 [1] CRAN (R 4.1.2)
##  viridisLite      0.4.0      2021-04-13 [1] CRAN (R 4.1.2)
##  webshot          0.5.2      2019-11-22 [1] CRAN (R 4.1.2)
##  withr            2.4.3      2021-11-30 [1] CRAN (R 4.1.2)
##  xfun             0.29       2021-12-14 [1] CRAN (R 4.1.2)
##  xml2             1.3.3      2021-11-30 [1] CRAN (R 4.1.2)
##  yaml             2.2.2      2022-01-25 [1] CRAN (R 4.1.2)
## 
##  [1] /usr/local/lib/R/site-library
##  [2] /usr/lib/R/site-library
##  [3] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────


HTML rendering

This chunk is not evaluated (eval = FALSE). Otherwise you will probably end up in recursive hell 🀯

library(conflicted)
library(knitr)
library(glue)
library(here)

if (!require(rmdformats)) {
  BiocManager::install("rmdformats")
}

# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE, eval = TRUE)
rmarkdown::render(
  here("E03-qPCR/qPCR.Rmd"),
  output_file = here("E03-qPCR/qPCR.html"),
  envir = new.env(),
  knit_root_dir = here()
)